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ABSTRACT 

In  this  paper,  we  propose  an  efficient  algorithm  to  train  a 
robust  large-margin  classifier,  when  corrupt  measurements 
caused  by  sensor  failure  might  be  present  in  the  training  set. 
By  incorporating  a  non-parametric  prior  based  on  the  empiri¬ 
cal  distribution  of  the  training  data,  we  propose  a  Geometric- 
Entropy-Minimization  regularized  Maximum  Entropy  Dis¬ 
crimination  (GEM-MED)  method  to  perform  classification 
and  anomaly  detection  in  a  joint  manner.  We  demonstrate 
that  our  proposed  method  can  yield  improved  performance 
over  previous  robust  classification  methods  in  terms  of  both 
classification  accuracy  and  anomaly  detection  rate  using  sim¬ 
ulated  data  and  real  footstep  data. 

Index  Terms —  corrupt  measurements,  robust  large- 
margin  training,  anomaly  detection,  maximum  entropy  dis¬ 
crimination 

1.  INTRODUCTION 

Large  margin  classifiers,  such  as  support  vector  machines 
(SVMs),  have  demonstrated  good  classification  performance 
when  the  training  data  is  representative  of  the  test  data 
[1,  2,  3].  However,  in  many  real-world  applications  training 
data  can  suffer  from  corrupted  measurements  due  to  sensor 
failure.  In  such  cases,  unless  one  accounts  for  possible  cor¬ 
ruption  of  the  training  data,  the  performance  of  the  classifier 
degrades  significantly.  This  paper  presents  a  new  and  effec¬ 
tive  approach  to  train  classifiers  with  corrupted  data. 

There  have  been  several  approaches  [4,  5,  6]  to  train  clas¬ 
sifiers  in  a  manner  that  is  robust  to  corrupted  training  data. 
Among  these  approaches,  one  common  strategy  is  to  use 
Ramp  Loss  [1,  7,  8],  which  explicitly  limits  the  value  of  the 
maximal  loss.  The  drawbacks  of  these  Ramp-Loss-based  ap¬ 
proaches  is  that  they  do  not  provide  a  unified  framework  for 
joint  anomaly  detection  and  classification,  and  they  are  not 
capable  of  handling  corrupted  training  samples. 

The  motivation  behind  our  proposed  algorithm  is  that  cor¬ 
rupted  data  can  be  detected  in  the  training  set  by  using 
anomaly  detection  techniques  [9]  during  the  classifier  training 
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process.  Such  techniques  are  expressely  designed  to  detect 
anomalies  in  order  to  attain  the  lowest  possible  false  alarm 
and  miss  probabilities.  In  keeping  with  the  non-parametric 
nature  of  the  SVM  classifier,  we  will  focus  on  non-parametric 
anomaly  detection  schemes.  Examples  include  minimal  vol¬ 
ume  (MV)  set  anomaly  detection  [10,  11],  and  minimal  en¬ 
tropy  set  anomaly  detection  [12,  13],  etc.  Among  them,  Hero 
et  al.  [12,  13]  developed  the  Geometric  Entropy  Minimiza¬ 
tion  (GEM)  principle  that  estimates  the  MV  set  based  on  the 
k-nearest  neighbor  graph  (k-NNG).  The  key  contribution  of 
this  paper  is  incorporation  of  GEM  anomaly  detection  into  an 
SVM  classifier  under  a  non-parametric  corrupt-data  model. 

Our  proposed  Geometric-Entropy-Minimization  regular¬ 
ized  Maximum  Entropy  Discrimination  (GEM-MED)  frame¬ 
work  integrates  large-margin  training  with  anomaly  detection 
using  a  Bayesian  convex  optimization  framework.  The  Max¬ 
imum  Entropy  Discrimination  (MED)  approach  proposed  by 
Jaakkola  et  al  [14]  performs  Bayesian  large  margin  classifica¬ 
tion  using  the  maximum  entropy  principle.  MED  subsumes 
SVM  as  a  special  case.  In  this  paper,  we  impose  an  adaptive 
large  margin  constraint  for  each  of  the  sample  instances.  This 
constraint  uses  a  Bayesian  prior  that  is  based  on  the  GEM 
principle  of  anomaly  detection.  The  resulting  GEM-MED 
classifier  effectively  reduces  the  impact  of  anomalous  sam¬ 
ples  on  classification.  We  demonstrate  superior  performance 
on  simulated  data  and  on  a  real  data  set,  containing  human 
and  human-leading-animal  footsteps,  collected  in  the  field  by 
acoustic  sensors  [15,  16,  17]. 

The  outline  of  the  paper  is  as  follows:  In  Section  2,  the 
MED  framework  is  introduced.  In  Section  3,  the  GEM-MED 
framework  is  presented  and  a  solution  is  proposed  using  vari¬ 
ational  inference.  In  Section  4,  experiment  results  based  on 
synthesis  data  and  real  data  are  presented. 

2.  LARGE-MARGIN  TRAINING  WITH  MED 

Let  the  training  data  set  be  V  =  {(x„,  yn)}n= l-  where  x„  £ 
1ZP,  y„  £  {—1, 1}.  Assume  the  predictive  distribution  is  log- 
linear,  i.e.  logp(y|x,  0)  oc  E(y,x;0)  =  |j/(wTx  +  &), 
where  0  =  (w,  b)  and  F(y ,  x;  0)  is  the  linear  discriminative 
function  parameterized  by  0.  Denote  the  prior  distribution  of 
0  as  po(0)-  The  goal  for  the  Maximum  Entropy  Discrimina¬ 
tion  (MED)  [14]  is  to  learn  a  posterior  distribution  p(0 \V), 
by  solving  an  entropic  regularized  risk  minimization  problem 
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with  prior  po(0) 

f1  -Ep(0|I,){AF(j/n,xn;0)}]+  (1) 

"  +KIL  (p(0|D)||po(0)) , 

where  [s]+  =  max{s,0}.  KIL(p||g)  is  the  Kullback- 
Leibler  divergence  from  distribution  p  to  q,  i.e. 

KlLP(e|i?)  (p(em\p0m  =  Jep(Q\V)  log  (^)  d& 
and  AF(y.n,xn]  0)  =  F(yn,xn;  0)  -  F(y  ±  y„,xn-  0)  = 
log  00))  t^le  l°o'°dds  that  defines  the  classifier. 

The  first  term  in  (1)  is  a  hinge-loss  that  captures  the  large- 
margin  principle  underlying  the  MED  prediction  rule, 

V *  =  argmaxp  Ep(ep)  [F(y,  x;  0)]  . 

As  in  kernel  SVM,  we  could  extend  the  predictive  distri¬ 
bution  to  a  log  non-linear  function  by  kernelization  [18,  19]. 
Specifically,  a  kernelized  version  of  MED  (1)  is  implemented 
by  redefining  the  predictive  distribution  as  F(y,x;  w)  = 
lyw7  <t>(x),  where  <!>  is  the  prescribed  feature  map.  De¬ 
fine  the  kernel  function  K  :  1ZP  x  1ZP  i— >  1Z  that  satis¬ 
fies  (^>(xi),  <f>(xj))  =  iT(xj,Xj),  for  Vxi,Xj  €  V.  If 
we  use  a  Gaussian  Process  [20]  as  the  prior  on  w,  i.e. 
Poiyv)  =  A/"(w;  0,1),  kernel  MED  is  obtained  by  solving 
(1)  with  0  =  w.  The  kernel  MED  approach  is  adopted  in 
Sec.  4  but,  for  simplicity,  we  assume  the  log-linear  model  in 
Sec.  3. 

Since  the  hinge  loss  in  (1)  is  applied  to  every  sample  in¬ 
stance  with  equal  weight,  the  training  the  MED  classifier  can 
be  overly  sensitive  to  anomalous  samples.  Therefore,  it  is  de¬ 
sirable  to  extend  MED  to  account  for  the  possible  presence  of 
anomalous  samples  in  the  training  set. 

3.  GEM  REGULARIZED  MED  (GED-MED) 

3.1.  GEM  regularized  MED:  model  development 

We  develop  a  model  that  explicitly  accounts  for  the  possi¬ 
ble  presence  of  anomalous  samples,  by  introducing  a  latent 
variable  rjn  £  {0, 1}  associated  with  each  sample  x„,  where 
7]n  =  1  means  that  x„  is  nominal  (uncorrupted),  and  rjn  =  0 
means  that  x„  is  anomalous.  Denote  rj  =  [771 , . . . ,  t]n]T ■ 

The  goal  is  to  learn  the  posterior  joint  distribution 
p(0,  J?  |  T>),  given  the  prior  p0{Q,r /)  =  p0(©)Po(»7)  = 
Po(©)  n  nPo(Vn)-  In  analogy  to  the  MED  principle  in  (1), 
we  propose  a  regularized  MED  framework  to  achieve  this 
goal, 

I1  ~  AU(j/n,x„|0)]  }  (2> 

p(©,T7  I  T>)  — 

n 

+C  Tlp(e,r,  |  t>)(v)  +  KE(p(0,  77  |  X>)||po(©) Po{v)), 

where  1ZP( |  v)(v)  Is  the  regularizer  on  r/  associated  with 
p((-).  77  I'D),  and  C  >  0  is  the  regularization  parameter. 

Note  that  the  first  term  in  (2)  couples  the  quality  of  data,  ijn, 
to  the  class  prediction  loss  [1  —  A F(yn,  xn|0)]+  .  Compared 
with  (1),  this  empirical  risk  is  adaptive  to  the  latent  state  77  of 
the  training  samples,  since  a  suspected  anomalous  instance 
(i.e.  r]n  =  0)  will  have  no  impact  on  the  risk  function. 

The  main  contribution  of  this  paper  is  the  inclusion  of 
the  regularization  term  1Zp(q  v  \  x>)  (v)  into  the  MED  frame¬ 


work.  This  term  is  estimated  via  GEM  principle  [12,  13]  and 
it  can  be  interpreted  equivalently  as  the  log  of  an  empirical 
Bayesian  prior  on  the  quality  indicator  rj.  In  contrast  to  a  data- 
independent  prior  Po(ri),  the  proposed  prior  'R,p^Q  ri\x>){rl) 
depends  on  the  empirical  probability  distribution  of  the  train¬ 
ing  data  T>.  This  is  further  discussed  in  Sec.  3.2. 

The  motivation  for  our  approach  is  that  the  Ramp-Loss- 
based  robust  classification  methods  [  1,  7,  8,  5,  6]  fail  to  cap¬ 
ture  a  crucial  characteristic  property  of  the  anomalous  data: 
such  data  tends  to  lie  in  the  tail  (low  probability  region)  of 
the  data  set.  As  previous  classification  models  do  not  explic¬ 
itly  account  for  the  tails  of  the  empirical  distribution  of  the 
training  data,  one  might  expect  significant  classification  per¬ 
formance  improvement  by  incorporating  such  tail  information 

via  7x^(0,^  |  x>)  (77). 

3.2.  The  construction  of  regularizer  via  GEM  principle 

To  construct  the  regularization  term  7Zp(q  v  \  -£,)  (v)  we  fol¬ 
low  the  Geometric  Entropy  Minimization  (GEM)  [12,  13] 
principle.  Specifically,  GEM  estimates  fly,  where  fly  = 
arg minJ4{iT(A)  :f4p(x)dx  >  1  — /3}  is  the  minimal-entropy 
-set  of  false  alarm  level  /3,  H(A)  =  —  JA  log p(x)  p(x)dx  is 
the  Shannon  entropy  of  the  density  p(x)  over  the  region  A. 
Given  fly,  a  sample  x„  is  declared  anomalous  if  x„  ^  fly. 
The  decision  rule  is  the  Uniformly  Most  Powerful  Test  at 
level  d  when  the  anomalies  are  drawn  from  an  unknown  mix¬ 
ture  of  known  nominal  distribution  p(x)  and  uniform  anoma¬ 
lous  distribution [12].  As  the  training  data  distribution  p(x)  is 
unknown,  GEM  approximates  this  decision  rule  by  replac¬ 
ing  fly  with  an  empirical  estimate  fly  using  the  empirical 
distribution  p(xn).  Since  r/„  is  the  indicator  function  of  the 
event  x„  ^  fly,  GEM  reduces  to  solving  rj*  =  argmin^j 
log  (p(x„))  :  A  r)n  >  1  -fd}.  To  approximate 

77*,  as  in  [12,  13]  a  k-nearest  neighbor  graph  approximation 
to  fly  is  constructed  from  the  data  set. 

In  our  model  GEM  is  implemented  by  using  a  bipartite 
K-point  kNN  graph  (BP-kNNG)  [13].  Specifically,  we  first 
split  the  training  set  into  two  parts  using  the  class  labels. 
The  BP-kNNG  anomaly  detector  of  [13]  is  then  implemented 
on  each  part  Xc  =  {xn  :  yn  =  c},  c  €  {±1}  indepen¬ 
dently.  This  results  in  partitioning  the  data  set  Xc  into  two 
parts  {A)vc,  Xmc}-  For  each  sample,  x„  £  Xjy  yn,  its  local 
entropy  is  estimated  via  —  log(p(x„))  =  d  log (i?fc(x„))  — 

log  ^  M~cd  )  >  where  i?fc(x„)  is  the  sum  of  k-nearest  neighbor 
(kNN)  distance  from  the  target  sample  x„  to  its  Mc  refer¬ 
ence  samples  in  Xmc  ;  d  is  the  intrinsic  dimension  of  x„  and 
Cd  is  the  volume  of  the  unit  ball  in  d  dimensions.  We  esti¬ 
mate  —  log(p(xm))  for  xm  £  Xmc  in  a  similar  manner.  Then 
the  set  £c  =  {—  log(p(x„))  :  yn  —  c}  is  arranged  in  as¬ 
cending  elements  of  the  order,  and  we  denote  the  sum  of  the 
first  sp  smallest  entropy  values  as  Sc.  The  threshold  sp  is 
set  using  the  heuristic  sp  =  arg  rnax/c {  —  log(p(x[fe-1l))  + 
log(p(x[fcl))|},  where  —  log(p(x[fel))  denotes  the  k-th  small¬ 
est  elements  in  £c. 
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Given  Sc  for  each  class  c  G  {  —  1,  +1}-  the  GEM  regular¬ 
ization  term  'R.p(c-).tI  \  t>)  iv)  is  defined  as 

|  D)  (r?)  (3) 


E 

C6{-1,+1} 


-'?(©,» 7  I  T>) 


-  E 


Vn- 


log  (P(xn)) 
AT 


-5c 


This  term  will  impose  a  penalty  for  r/n  =  1,  when  x„  ^  Op, 
since  the  value  of  Sc  will  make  the  summand  of  (3)  positive. 

Substituting  (3)  into  (2),  the  GEM  regularize d-MED  (GEM 
-MED)  is  obtained  as 

min  ,EEp(0,il®){’1"  I1  -  Ai?(2M>x»l0)]  +  } 
p(e,?7 1  t>)  1  ’  T 


(4) 


E 

ce{-i,+i} 


-  E 


Vn- 


log  (P(xn)) 
N 


-Sc 


+KIL(p(0,  r?  |  D)||p0(©)po(r?)), 

where,  as  in  (2),  C  is  a  regularization  parameter. 

3.3.  Solving  GEM-MED  via  variational  inference 

Similar  to  the  form  of  p(Q\T>)  in  MED,  p((~).  r/\V)  in  GEM¬ 
MED  (4)  takes  the  form 


p(0,  rj|ci!,  fi,  T>)  = 


Z(a,  /i) 


po(0)exp(-^pnan{l  -  AFn(0)}) 


V-  „  l°g(p(x„))  0 

/  f/n  7T  ‘Jc 


N 


xpo(n)exp  I  -  E  pc 

\  cg{— i,+i>  l  n-yn=c 

s.t.  0  <  a  <  1,  0  A  p  A  Cl, 

where  1  =  [1,...,1]T,  A  Fn{&)  =  AF(yn,  x„|0). 

Z (a.,  fi)  is  the  partition  function  of  the  distribution  and  a.  = 
\a\, . . . ,  «at]t  and  p  =  [p-\,  pi]T  are  dual  variables  associ¬ 
ated  with  the  hinge  loss  and  regularizer  in  (4),  respectively. 

To  identify  the  dual  variables  a  and  p,  it  is  required  to 
solve  for  the  minimum  of  the  log-partition  function 
max  —  log  Z(a,  n) 

=  -!°g  E  [  p(e>r?la’ft’  V)d&-  (5) 

ve{o,i}^Je 

As  (5)  is  concave  in  a  and  p  we  propose  a  projected  stochas¬ 
tic  gradient  descent  algorithm  (PSGD)  [21]  with  Gibbs  sam¬ 
pling.  The  procedure  is  summarized  in  Algorithm  1.  As 
noted,  we  define  nn  =  p(rjn,T  =  1|0T)22)  as  the  anomaly 
score  for  each  sample,  where  T  is  the  final  iteration  of  the 
PSGD  algorithm. 

4.  EXPERIMENT 
4.1.  Simulated  data 

The  first  experiment  is  conducted  on  a  simulated  data  set. 
For  each  class  c  G  {±1}  samples  are  generated  from  the  bi¬ 
variate  Gaussian  distribution,  with  mean  m  =  (3,  3)  and 
,  v  [  20  16 

lo  2U 

Here  0  =  (w,  b)  has  Gaussian  prior  in  product  form po(Q)  = 
A/”(w;  0,  T)Af(b:  0,  <rf).  F°r  each  sample  n,  the  prior  on  ijn  is 
set  to  be  equally  likely:  po(Vn)  =  1/2. 

The  anomalies  in  the  training  set  are  drawn  uniformly  from 
a  ring  with  an  inner  radius  of  R  and  outer  radius  R+ 1,  where 


Algorithm  1  GEM  regularized  MED 


Input:  V  =  {(x„,  y„)}^=1,  where  xn  G  TV,  y„  G  {±1}.  Prior 
distribution  po(©),  Po(Vn)  and  the  upper  bound  Sc,  c  G  {±1}. 
1:  Initialize:  Set  p0  =  0.  ao  is  set  by  applying  MED  on  T> 

2:  for  t  =  1, . . . ,  T  or  until  converge  do 

3:  Compute  the  gradient  of  log-partition  function  w.r.t  at  and 

pt,  respectively,  as 

tin  {A F(yn,  xn;  0)  -  1}] , 

n  =  1, . . . ,  N, 


<9  log  Z(at,pt) 
dp,c 


log  (p(x„)) 

E  ^  N 

n:yn=c 


+  sc 


c  G  {— 1,  -FI}, 

where  the  expectation  is  approximated  via  Gibbs  sampling 
with  the  conditional  density  p(t j|0,  I>)  =  n„=i  p(Pn\&,  E>)  and 
p(0|r),  T>)  computed  explicitly. 


4:  Update  a„  and  pc  via  projected  gradient  descent,  i.e. 

1) 


Mc,(t+1) 

where  proj^.  0<W<C\ {w}  =  min  (max(w,  0),  C)  defines  the 
projection  of  w  on  the  feasible  set  :  0  <  w  <  C}  via 
clipping  and  tp,  ip  >  0  define  the  learning  rate. 

5:  end  for 

Output:  Assign  label  for  test  sample  x  as 

V  =  argmaxs  J  p(y|x,  ©)p(©|D)d©, 
where  p(e|X>)  =  J2v€,0  i>N  p(®.  17 1  X>)  is  computed  via  marginal¬ 
ization.  Also  obtain  the  posterior  on  r/  at  the  final  iteration  of 
step  4.  {nn  =  p(r]n,T  =  1|©t,  12)}.  as  the  anomaly  scores. 


—  ProJ{a:  0<a<l}  —  V3 

n  =  1 , ,N, 

=  Pr°j{p:0 <M<C}  -  V’ 

c  G  {-!,+!}, 


c)  log  Z  (at,  fJ.t) 
dan 

d  log  Z (at,  pt) 


dpc 


the  value  of  R  indicates  the  noise  level  in  the  corrupted  train¬ 
ing  set  [5].  We  fix  the  size  of  the  training  set  to  be  100  for 
each  class,  with  ratio  of  anomaly  samples  denoted  as  ra.  The 
test  set  contains  2000  samples  from  each  class. 

We  first  compare  the  classification  accuracy  of  MED/SVM 
using  LibSVM  [22],  Robust-Outlier-Detection  (ROD)  [5]  with 
outlier  parameter  p  G  [0.01,1]  and  GEM-MED,  under 
noise  level  R  =  [15, 35, 55,  75]  and  corruption  rate  ra  = 
[0.2,  0.3, 0.4,  0.5].  All  the  reported  results  are  averaged  over 
50  runs.  Fig  1(a)  shows  the  mean  and  standard  deviation  of 
the  test  errors  for  these  models  versus  various  noise  level  R 
(with  ra  =  0.2),  and  Fig  1(b)  shows  the  test  errors  under  dif¬ 
ferent  corruption  rate  settings  (with  R  =  55).  For  ROD,  only 
p  G  (0.02, 0.2, 0.6}  are  shown  for  simplicity,  while  p  =  0.02 
is  the  best  for  Vp  G  [0.01,1].  In  both  experiments,  as  the 
noise  level  or  the  corruption  rate  increases,  the  training  data 
become  less  representative  of  the  test  data  and  the  difference 
between  their  distributions  increases,  which  causes  a  signif¬ 
icant  increase  of  test  error  for  MED/SVM  method.  While 
both  ROD  and  GEM-MED  limit  the  maximal  loss  values  dur¬ 
ing  training,  and  thus  prevent  over-fitting  to  anomalies,  GEM¬ 
MED  outperforms  ROD  as  it  incorporates  the  nonparametric 
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prior  IZ(rj)  that  is  adaptive  to  anomalies  in  the  training  set, 
as  opposed  to  ROD,  which  relies  on  the  predefined  tuning  pa¬ 
rameter  p. 

We  then  evaluate  the  efficiency  of  anomaly  detection  for 
various  RODs  and  GEM-MED,  under  the  fixed  corruption 
rate  0.2.  The  Tr'ns  in  GEM-MED  and  RODs  are  used  as 
anomaly  scores  and  are  placed  in  ascending  order.  We  com¬ 
pute  the  precision  and  recall  using  this  ordering,  averaging 
over  50  runs.  Fig. 2(a)  plots  the  precision  versus  recall  curve 
for  various  RODs  and  GEM-MED,  with  a  snapshot  of  a  typ¬ 
ical  result  illustrated  in  Fig. 2(b).  As  seen  in  both  figures,  the 
anomaly  score  given  by  GEM-MED  provides  more  relevant 
information  about  the  true  anomalies,  compared  to  that  given 
by  RODs.  This  is  due  to  the  additional  GEM-based  regular- 
izer  lZ(r])  in  GEM-MED,  which  captures  the  characteristic 
of  an  anomalous  sample  based  on  the  relative  entropy  of  the 
region  in  which  it  resides.  GEM-MED,  thus,  has  better  per¬ 
formance  in  terms  of  the  efficiency  and  accuracy  of  anomaly 
detection  than  does  ROD. 


SVM/MED 
ROD-0  60  t 

—  ROD-O  20  I 

—  ROD-O  02  1 

—  GEM  MED 


I  I 


30  .40  ,  50,  60 

noise  level  R 
(a) 


'  20- 
10 


. SVM/MED 

ROD-0.60 

—  ROD-0.20 

—  -  ROD-0.02 

—  GEM-MED 


30  40 

corrupt  rate  (%) 

(b) 


Fig.  1:  (a)  Test  error  (%)  vs.  noise  level  R  (corruption  rate  =  0.2).  (b)  Test 
error  (%)  vs.  corrupt  rate  ( R  =  55)  on  simulated  data.  GEM-MED  out¬ 
performs  both  MED/SVM  and  ROD  for  various  p  in  classification  accuracy, 
when  either  noise  level  or  corrupt  rate  increases  . 


Fig.  2:  (a)  Recall-precision  curve  for  GEM-MED  and  RODs  on  simulated  data 
(corruption  rate  =  0.2).  (b)  Illustration  of  anomaly  score  7rn  for  GEM-MED 
and  ROD-0.2.  GEM-MED  surpluses  ROD  in  anomaly  detection. 


4.2.  Footstep  classification  data  set 

We  perform  experiments  on  ARL-Footstep  multisensor  data 
set  [  17,  16,  23],  where  the  task  is  to  discriminate  between  hu¬ 
man  footsteps  and  human-leading  animal  footsteps.  The  foot¬ 
step  data  was  collected  via  four  well-synchronized  acoustic 
sensors  (labeled  as  Sensor  1,2,3, 4)  in  a  natural  environment, 
where  the  environmental  noise  and  multiple  sensor  failures 
corrupted  the  acoustic  recordings.  It  involves  84  human  sub¬ 
jects  and  66  human-animal  subjects.  We  randomly  select  25 
subjects  from  each  class  as  the  training  set,  with  the  rest  des¬ 
ignated  as  the  test  set.  In  the  preprocessing  step,  footsteps  are 
detected,  extracted  and  segmented  before  a  200-dimensional 


mel-frequency  cepstral  coefficients  (MFCCs)  vector  is  com¬ 
puted  for  each  segment.  We  then  apply  PCA  to  reduce  the 
dimensionality  from  200  to  50,  as  in  [17,  23].  For  multiple 
D  sensors  the  augmented  feature  of  dimension  501?  is  con¬ 
structed  via  feature  concatenation. 

In  these  experiments,  we  apply  kernel  MED  with  the  Gaus¬ 
sian  kernel  K(xi,  x?)  =  exp(— 7||x,  —  Xjj||);  the  kernel  pa¬ 
rameter  7  >  0  is  tuned  via  5-fold-cross-validation.  A  Gaus¬ 
sian  Process  A/"(w;  0, 1)  is  used  as  prior  on  w.  For  each  sam¬ 
ple  n,  the  prior  on  pn  is  set  to  po(Vn)  =  1/2. 

Table  1  shows  the  classification  accuracy  for  four  individ¬ 
ual  sensors  and  the  combination  of  all  four  sensors  with  ker¬ 
nel  MED,  ROD  for  p  6  [0.01, 1]  and  GEM-MED  and  Ta¬ 
ble  2  shows  their  respective  anomaly  detection  accuracies. 
For  ROD  only  p  =  0.02  and  p  =  0.20  are  shown,  while 
p  =  0.20  is  the  best  for  Vp  €  [0.01, 1[.  It  is  seen  that  the 
GEM-MED  method  outperforms  all  of  the  ROD-p  algorithms 
and  also  outperforms  kernel  MED  in  classification  accuracy 
for  sensor  1,2,4  and  it  demonstrates  significant  improvement 
in  detection  for  sensor  1,3,4.  Note  that  ROD— 0.2  has  higher 
detection  accuracy  than  GEM-MED  in  sensor  2,  since  many 
anomalous  samples  in  this  sensor  reside  in  the  high  density 
region  of  the  data  set,  which  violates  the  sparse  anomaly  as¬ 
sumption  underlying  GEM.  For  combined  sensors,  as  most  of 
the  anomalies  based  on  the  joint  feature  representation  reside 
in  the  high  entropy  region  of  the  data  set,  GEM-MED  is  able 
to  successfully  detect  most  of  the  anomalous  samples. 


Classification  Accuracy  (%)  mean  zb  standard  error 

sensor  no. 

kernel  MED 

ROD-0.02 

ROD-0.2 

GEM-MED 

i 

71.1  ±  5.3 

73.7  ±3.7 

76.0  ±2.5 

78.4  ±3.3 

2 

62.3  ±  10.2 

71.5  ±7.3 

76.5  ±  5.3 

82.1  ±  3.1 

3 

60.0  ±  13.1 

63.2  ±  5.4 

67.6  ±4.2 

66.8  ±4.5 

4 

58.4  ±8.2 

71.8  ±7.2 

73.2  ±4.2 

80.1  ±  3.1 

1, 2,3,4 

78.6  ±5.1 

79.2  ±3.7 

79.8  ±  2.5 

84.0  ±  2.3 

Table  1:  Classification  accuracy  with  different  sensors,  with  the  best  perfor¬ 
mance  shown  in  bold. 


Anomalous  Detection  Accuracy  (%)  mean  zb  standard  error 

sensor  no. 

ROD-0.02 

ROD-0.2 

GEM-MED 

i 

30.2  ±  1.3 

59.0  ±3.5 

70.5  ±  1.3 

2 

23.5  ±2.6 

63.5  ±2.8 

63.4  ±2.5 

3 

5.3  ±1.4 

48.1  ±3.3 

72.8  ±  1.5 

4 

22.8  ±3.2 

65.2  ±4.2 

88.1  ±  2.1 

Table  2:  Anomalous  detection  accuracy  with  different  sensors,  with  the  best 
performance  shown  in  bold. 


5.  CONCLUSION 

In  this  paper  we  propose  the  GEM-MED  algorithm  that  pro¬ 
vides  a  unified  optimization  framework  for  classification  and 
anomaly  detection.  We  demonstrate  its  performance  advan¬ 
tages  in  terms  of  both  classification  accuracy  and  detection 
rate  on  a  simulated  data  set  and  a  real  footstep  data  set,  as 
compared  to  the  anomaly-blind  Ramp-Loss-based  classifica¬ 
tion  method. 
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